In this project, I am going to be analyzing Baltimore Crime data, attempting to infer major factors for crime, including where they tend to occur with respect to geographic features.
According to neighborhoodscout.com, Baltimore has a "Total Crime Index" of 3, meaning that is only safer than roughly 3% of neighborhoods across the United States. This is an absurd statistic; why is the most populous city in Maryland (and by a long shot too) such a dangerous place to live? In order to cut down on crime, we must first understand why it is occurring so that we might address the underlying issues at their source. Obviously, "why does crime occur" is an immensely complex question which would require millions of dollars of research in order to answer concretely. However, what we can do is draw correlations between crime and different suggesting factors, breaking down the overall problem into pieces that are more directly addressable. Armed with this knowledge, the police can more intelligently allocate their resources to areas with higher crime rates, and may be able to predict how city-level changes will affect crime in the future. For example, if we found that proximity to car dealerships had a significant effect on crime rates in the surrounding area, we could pre-emptively increase police presence in an area with an upcoming dealership to deter crime before it occurs. Overall, by analyzing existing crime data, we may be able to draw conclusions that will reduce crime in the future.
Before we get into our code, we will determine a list of factors to look into The source I pulled the factors from will be linked above each set of factors.
Crime deterrent effect of police stations by Fondevila et al. 3. proximity to police stations
The Association between Density of Alcohol Establishments and Violent Crime within Urban Neighborhoods by Toomey et al. 4. proximity to bars
Baltimore Liquor Stores Linked More to Violent Crime Than Bars and Restaurants, published by the John Hopkins Bloomberg School of Public Health 5. proximity to liquor stores
Understanding the Criminogenic Properties of Vacant Housing: A Mixed Methods Approach by Porter et al. (this is a UMD paper!) 6. proximity to vacant buildings
Exploring the relationship between strip clubs and rates of sexual violence and violent crime | Request PDF 7. proximity to strip clubs
Let's import the core libraries that we will be using in our project.
import requests
import pandas as pd
import plotly.express as px
import geopandas as gpd
from shapely.geometry import Point
from bs4 import BeautifulSoup
import re
from time import sleep
data_files = [
'crime-data',
'crime-codes',
'police-stations',
'population-density',
'median-income',
'liquor-licenses',
'vacant-buildings',
'baltimore-csa'
]
We only want to download the data once, so that future re-runs do not result in downloading it again. To accomplish this, we will create a version control system for our data. The benefit of this is that when we change our computation logic, we will not have to redownload any data. First, let's put data corresponding to each tag in its own subdirectory.
def data_directory(tag) -> str:
directory = f'data/{tag}'
try:
os.mkdir(directory)
except:
pass
return directory
Next, this function will, given the "tag" of the topic and a version,
yield a filepath for the data. For example,
data_filepath('crime-data', 0) would yield data/crime-data/v0.csv.
def data_filepath(tag, version=0, extension=None) -> str:
if tag not in data_files:
raise Exception(f'Invalid tag: {tag}')
directory = data_directory(tag)
if extension is None:
# then, autodetect extension
available_files = os.listdir(directory)
found_file = next((filename for filename in available_files if filename.startswith(f'v{version}')))
_, extension = path.splitext(found_file)
else:
extension = f'.{extension}'
return f'{directory}/v{version}{extension}'
from typing import Iterable, Generator
def delay_iterable(iterable: Iterable, delay_seconds=0.5) -> Generator:
is_first_element = True
for element in iterable:
if not is_first_element:
sleep(delay_seconds)
else:
is_first_element = False
yield element
Let's write another function for simply delaying repeated function calls.
def delay_fn(f, delay_seconds=0.5):
is_first_time = True
def delayed_fn(*args, **kwargs):
nonlocal f, is_first_time
if not is_first_time:
sleep(delay_seconds)
else:
is_first_time = False
return f(*args, **kwargs)
return delayed_fn
We don't want to refetch our data every time we run the notebook, so let's write some helper functions that allow us to cache our data after we download it.
from os import path
def download_cached(url: str, filepath: str):
# if the file already exists, does nothing.
# otherwise, downloads to filepath
if not path.exists(filepath):
with open(filepath, 'w+') as f:
f.write(requests.get(url).text)
This function will normalize a Series between 0 and 1.
def normalize_min_max(series: 'Series') -> 'Series':
return (series - series.min()) / (series.max() - series.min())
Let's write a generalized loading function which will determine how to read a file based on the extension. By default, it will get the freshest version of the data by finding the highest version of the data in its data directory. This will allow us to forget the details of how we collected the data and instead focus on the data itself.
import os
from os import path
def latest_data_filepath(key) -> str:
data_dir = data_directory(key)
versions = os.listdir(data_dir)
# return the lexicographically highest one
return f'{data_dir}/{max(versions)}'
def generic_load(key, version=None) -> object:
"""
If `version` is specified, gets that specific version of the data.
"""
file_path = data_filepath(key, version) if version is not None else latest_data_filepath(key)
# then, fallback to static data
_, extension = path.splitext(file_path)
if extension == '.csv':
return pd.read_csv(file_path)
elif extension == '.json':
return pd.read_json(file_path)
elif extension == '.geoparquet':
return gpd.read_parquet(file_path)
elif extension == '.geojson':
return gpd.read_file(file_path)
elif extension == '.parquet':
return pd.read_parquet(file_path)
else:
raise Exception(f'Invalid format: {extension}')
Next, let's write a generalized exporting function which will allow us
to export our dataframes to the filesystem. For GeoDataFrame types, we
specifically use the parquet data format since it is significantly
more space-efficient than geojson. The reason we use parquet over
csv for regular DataFrames is because when reading from csv, some
"special" fields like DateTime's are not read into the proper types.
def generic_export(key, data: 'DataFrame', version):
extension = 'geoparquet' if isinstance(data, gpd.GeoDataFrame) else 'parquet'
path = data_filepath(key, version=version, extension=extension)
data.to_parquet(path)
We will need to use the Google Maps Place Search API at several points in our Data Collection for geocoding and categorizing addresses. Let's write a function which will yield the URL and get parameters to make a request to the Places API, given a search query.
# unlike requests, this package allows us to do requests in parallel
find_place_from_text_endpoint = r'https://maps.googleapis.com/maps/api/place/findplacefromtext/json'
def google_find_place_from_text_params(search_query):
get_params = {
'input': search_query,
'inputtype': 'textquery',
'fields': 'type,geometry,name,place_id',
'key': os.environ['GOOGLE_API_KEY']
}
# url, kwargs
return (find_place_from_text_endpoint, get_params)
This is a function which will extract a geometry component out of the
response to a find place from text request.
from shapely.geometry import Point
def extract_geometry_from_places_request(response):
try:
location = response['candidates'][0]['geometry']['location']
return Point(location['lng'], location['lat'])
except:
return None
Now, let's write a function which will run a list of requests in parallel.
from requests_futures.sessions import FuturesSession
from concurrent.futures import as_completed
def run_requests(request_param_list: 'List[Tuple[str, dict]]'):
session = FuturesSession(max_workers=4)
futures = []
for (index, (url, get_params)) in enumerate(request_param_list):
future = session.get(url, params=get_params)
future.index = index
futures.append(future)
resolved_futures = list(as_completed(futures))
# sort by index
resolved_futures.sort(key=lambda f: f.index)
# return the results
return list(map(lambda f: f.result().json(), resolved_futures))
Let's also define a helper for mapping a function over the rows of a DataFrame.
def map_df(function, df):
return (function(row) for row in df.to_dict(orient='records'))
Let's get the bounding boxes of Baltimore's "Community Statistical Areas", which are "clusters of neighborhoods developed by the City's Planning Department based on recognizable city neighborhoods" (Source).
By using download_cached, we will only download the file if it does
not already exist.
download_url = r'https://opendata.arcgis.com/api/v3/datasets/9c96ae20e6cc41258015c2fd288716c4_0/downloads/data?format=geojson&spatialRefId=4326&where=1%3D1'
download_filepath = data_filepath('baltimore-csa', version=0, extension='geojson')
download_cached(download_url, download_filepath)
Let's load the data and see what columns we have to work with.
baltimore_csa_df = generic_load('baltimore-csa')
print(baltimore_csa_df.columns)
Index(['FID', 'Community', 'Neigh', 'Tracts', 'Link', 'geometry'], dtype='object')
Now, let's plot our data, highlighting the different communities.
baltimore_csa_df.plot(column='Community')
<Axes: >
In this section, we are going to use the Baltimore Crime Data dataset, published by the city of Baltimore itself. Let's start off by downloading the data.
download_url = r'https://download.data.world/file_download/baltimore/baltimore-crime-data/BPD_Part_1_Victim_Based_Crime_Data.csv'
download_filepath = data_filepath('crime-data', 0)
download_cached(download_url, download_filepath)
Now, let's look at the data and see how we can clean it up.
crime_df = pd.read_csv(download_filepath)
print(crime_df.head())
CrimeDate CrimeTime CrimeCode Location Description Weapon
0 06/18/2016 00:33:00 4E 2700 CHESLEY AVE I HANDS \
1 06/18/2016 00:39:00 4B 2700 FAIT AVE O KNIFE
2 06/18/2016 0015 9S 2400 CYLBURN AV Outside FIREARM
3 06/18/2016 01:53:00 3AF 2300 ORLEANS ST O FIREARM
4 06/18/2016 02:05:00 6C 800 N WOLFE ST I NaN
Post District Neighborhood Location 1
0 424.0 NORTHEASTERN North Harford Road (39.3679000000, -76.5555900000) \
1 232.0 SOUTHEASTERN Canton (39.2831500000, -76.5783400000)
2 532.0 NORTHERN Levindale (39.3510400000, -76.6597600000)
3 221.0 SOUTHEASTERN McElderry Park (39.2955600000, -76.5844600000)
4 321.0 EASTERN Middle East (39.3002700000, -76.5909700000)
Total Incidents
0 1
1 1
2 1
3 1
4 1
First, let's convert CrimeDate and CrimeTime, both currently
strings, into a single CrimeDateTime column. Looking at the data, we
have certain CrimeTime entries which seem to have the format
<hour><minute> instead of the regular <hour>:<minute>:<seconds>
format. Before we convert, we have to first normalize the outliers.
bad_time_format = re.compile('^(\d{2})(\d{2})$')
crime_df['CrimeTime'] = crime_df['CrimeTime'].str.replace(bad_time_format, r'\1:\2:00', regex=True)
Let's make sure that all of the rows match our expected regex by printing out all of the times that don't match what we're expecting.
expected_time_format = re.compile('^\d{2}:\d{2}:\d{2}$')
print(crime_df[~crime_df['CrimeTime'].str.fullmatch(expected_time_format)]['CrimeTime'])
Series([], Name: CrimeTime, dtype: object)
The list is empty, so we're good. Now, let's do the DateTime
conversion.
crime_time = pd.to_timedelta(crime_df['CrimeTime'])
crime_df['CrimeDateTime'] = pd.to_datetime(crime_df['CrimeDate'], format='%m/%d/%Y') + crime_time
print(crime_df.head())
CrimeDate CrimeTime CrimeCode Location Description Weapon
0 06/18/2016 00:33:00 4E 2700 CHESLEY AVE I HANDS \
1 06/18/2016 00:39:00 4B 2700 FAIT AVE O KNIFE
2 06/18/2016 00:15:00 9S 2400 CYLBURN AV Outside FIREARM
3 06/18/2016 01:53:00 3AF 2300 ORLEANS ST O FIREARM
4 06/18/2016 02:05:00 6C 800 N WOLFE ST I NaN
Post District Neighborhood Location 1
0 424.0 NORTHEASTERN North Harford Road (39.3679000000, -76.5555900000) \
1 232.0 SOUTHEASTERN Canton (39.2831500000, -76.5783400000)
2 532.0 NORTHERN Levindale (39.3510400000, -76.6597600000)
3 221.0 SOUTHEASTERN McElderry Park (39.2955600000, -76.5844600000)
4 321.0 EASTERN Middle East (39.3002700000, -76.5909700000)
Total Incidents CrimeDateTime
0 1 2016-06-18 00:33:00
1 1 2016-06-18 00:39:00
2 1 2016-06-18 00:15:00
3 1 2016-06-18 01:53:00
4 1 2016-06-18 02:05:00
Let's drop any rows which do not contain Location's, since this is our
most important feature.
crime_df = crime_df.drop(crime_df[crime_df['Location 1'].isnull()].index)
Let's also convert our Location column into shapely Point objects.
Note that because geopandas expectes x to correspond to longitude
and y to latitude, we have to reverse the initial order.
geometry = crime_df['Location 1'].str.extract('\((-?\d+\.\d+), (-?\d+\.\d+)\)', expand=True)
crime_df['geometry'] = gpd.points_from_xy(geometry[1].astype('float64'), geometry[0].astype('float64'))
crime_df = gpd.GeoDataFrame(crime_df)
Now, let's look at a Scatterplot of our data.
crime_df.plot()
<Axes: >
It seems that there is a cluster of points which lie far away from the rest of the points. Let's see what's going on with them.
outliers = crime_df[crime_df['geometry'].y > 41]
print(outliers)
CrimeDate CrimeTime CrimeCode Location Description
48 06/18/2016 14:00:00 6E 2800 ELLICOTT DY O \
579 06/13/2016 04:00:00 5D 800 BOYD ST I
726 06/12/2016 13:00:00 6E 4100 HAYWARD AVE O
768 06/12/2016 20:39:00 4E AV & GLENFALLS AV O
997 06/10/2016 14:35:00 3AF ST & TOONE ST O
... ... ... ... ... ...
22353 12/14/2015 11:30:00 5A 3300 LIBERTY HEIGHTS AVE I
22541 12/13/2015 23:20:00 4E 3200 O''DONNELL ST I
22725 12/11/2015 01:01:00 7A 0 E OSTEND ST O
26154 11/15/2015 10:20:00 4A 300 E MADISON ST I
32009 10/03/2015 11:23:00 6C 4900 FREDERICK AVE I
Weapon Post District Neighborhood
48 NaN 814.0 SOUTHWESTERN Winchester \
579 NaN 931.0 SOUTHERN Hollins Market
726 NaN 634.0 NORTHWESTERN Woodmere
768 HANDS 425.0 NORTHEASTERN Glenham-Belhar
997 FIREARM 233.0 SOUTHEASTERN Medford
... ... ... ... ...
22353 NaN 642.0 NORTHWESTERN Forest Park
22541 HANDS 232.0 SOUTHEASTERN Canton
22725 NaN 942.0 SOUTHERN Federal Hill
26154 FIREARM 312.0 EASTERN Penn-Fallsway
32009 NaN 833.0 SOUTHWESTERN Beechfield
Location 1 Total Incidents CrimeDateTime
48 (41.5286100000, -76.6537300000) 1 2016-06-18 14:00:00 \
579 (41.5553400000, -76.6178600000) 1 2016-06-13 04:00:00
726 (41.5102000000, -76.6784100000) 1 2016-06-12 13:00:00
768 (41.6228300000, -76.5271200000) 1 2016-06-12 20:39:00
997 (41.6213600000, -76.5291100000) 1 2016-06-10 14:35:00
... ... ... ...
22353 (41.5228600000, -76.6614400000) 1 2015-12-14 11:30:00
22541 (41.6016700000, -76.5556000000) 1 2015-12-13 23:20:00
22725 (41.5677600000, -76.6011900000) 1 2015-12-11 01:01:00
26154 (41.5707600000, -76.5971500000) 1 2015-11-15 10:20:00
32009 (41.5043600000, -76.6862300000) 1 2015-10-03 11:23:00
geometry
48 POINT (-76.65373 41.52861)
579 POINT (-76.61786 41.55534)
726 POINT (-76.67841 41.51020)
768 POINT (-76.52712 41.62283)
997 POINT (-76.52911 41.62136)
... ...
22353 POINT (-76.66144 41.52286)
22541 POINT (-76.55560 41.60167)
22725 POINT (-76.60119 41.56776)
26154 POINT (-76.59715 41.57076)
32009 POINT (-76.68623 41.50436)
[116 rows x 13 columns]
After observing a couple of the addresses, it seems that the geometry
field was input incorrectly, but the addresses are regular. To fix the
geometry, we'll use the Google Places API for geocoding.
def find_place_with_address(row):
return google_find_place_from_text_params(f"{row['Location']}, Baltimore")
geocoded_results = run_requests(map_df(find_place_with_address, outliers))
print(geocoded_results[0])
{'candidates': [{'geometry': {'location': {'lat': 39.2879496, 'lng': -76.6131106}, 'viewport': {'northeast': {'lat': 39.28929942989272, 'lng': -76.61176077010728}, 'southwest': {'lat': 39.28659977010728, 'lng': -76.61446042989272}}}, 'name': 'Ellicott St', 'place_id': 'ChIJxce0HJ4EyIkRIUuN8vhOQWQ', 'types': ['route']}], 'status': 'OK'}
Now, let's merge our results back into the parent DataFrame.
crime_df.loc[outliers.index, 'geometry'] = list(map(extract_geometry_from_places_request, geocoded_results))
Let's see if any of the geolocations failed.
failed_geolocations = crime_df[crime_df['geometry'].isnull()]
print(len(failed_geolocations))
2
We will drop these rows, since they likely correspond to misinputs.
crime_df = crime_df.drop(failed_geolocations.index)
Now, let's plot our data again and make sure it looks correct.
crime_df.plot()
<Axes: >
This looks a lot more like Baltimore!
Finally, let's drop the initial Date and Time columns as well as the
Location column, and write to our dynamic data file. As a short aside,
this dataset takes almost 100M as a geojson file, while only taking 7M
using the parquet format.
crime_df = crime_df.drop(['CrimeDate', 'CrimeTime', 'Location 1'], axis=1)
generic_export('crime-data', crime_df, version=1)
CrimeCode seems important, but in its current form it's not very
useful. To know what each crime code means, let's download the companion
CRIME
CODES
dataset.
download_url = r'https://www.arcgis.com/sharing/rest/content/items/e6ca4eadecdc475a961f68bc314e2a86/data'
download_filepath = data_filepath('crime-codes', version=0)
download_cached(download_url, download_filepath)
Let's see what the data looks like before moving on.
crime_codes_df = pd.read_csv(download_filepath)
print(crime_codes_df.head())
CODE TYPE NAME CLASS NAME_COMBINE WEAPON 0 13 CTYP ASSIST OFFICER CFS ASSIST OFFICER NaN \ 1 1A CTYP MURDER PART 1 HOMICIDE OTHER 2 1F CTYP MURDER PART 1 HOMICIDE FIREARM 3 1K CTYP MURDER PART 1 HOMICIDE KNIFE 4 1O CTYP MURDER PART 1 HOMICIDE OTHER VIOLENT_CR VIO_PROP_CFS 0 NaN OTHER 1 HOMICIDE VIOLENT 2 HOMICIDE VIOLENT 3 HOMICIDE VIOLENT 4 HOMICIDE VIOLENT
There are 9 districts in Baltimore, corresponding to the 4 cardinal
directions, 4 in-betweens and one central district. To get the locations
of the police stations in Baltimore, we will webscrape
https://www.baltimorepolice.org/find-my-district, get the addresses of
each station, and then use geopy to get the lat/long from each
address.
First, let's set a constant for our base URL, and abstract out our directions into lists.
base_url = 'https://www.baltimorepolice.org/find-my-district'
vertical_directions = ['north', 'south']
horizontal_directions = ['east', 'west']
Let's start by setting our central station.
stations = ['central']
Now, let's add in each compass direction, appending an "ern" to the end of each one, i.e "east" becomes "eastern".
for direction in vertical_directions + horizontal_directions:
stations.append(f'{direction}ern')
Next, we'll add the compound directions, which are formed by joining a vertical and horizontal direction, followed by "ern" like before.
for vertical in vertical_directions:
for horizontal in horizontal_directions:
stations.append(f'{vertical}{horizontal}ern')
Now that we have a list of all of our stations, let's make a dictionary mapping each station to its address. First, let's write a function that will lookup the address of a single station.
address_pattern = re.compile(r'Address: (.+)')
def police_lookup_address(station: str) -> str:
r = requests.get(f'{base_url}/{station}-district')
soup = BeautifulSoup(r.text)
combined_text = soup.get_text()
search_result = address_pattern.search(combined_text)
# return the first capture group
return search_result.group(1)
Now, let's make a DataFrame for our stations.
stations_df = pd.DataFrame.from_dict({'station': stations})
Let's add a row for the address of each station.
stations_df['address'] = stations_df.apply(delay_fn(lambda row: police_lookup_address(row.station)), axis=1)
print(stations_df)
station address 0 central 501 N. Calvert Street Baltimore, MD 21202 1 northern 2201 W. Cold Spring Lane, Baltimore, Md., 21215 2 southern 10 Cherry Hill Road, Baltimore, MD 21225 3 eastern 1620 Edison Highway, Baltimore, MD 21213 4 western 1034 N. Mount Street, Baltimore, MD 21217 5 northeastern 1900 Argonne Drive, Baltimore, MD 21218 6 northwestern 5271 Reisterstown Road, Baltimore, MD 21215 7 southeastern 5710 Eastern Avenue, Baltimore, MD 21224 8 southwestern 424 Font Hill Avenue, Baltimore, MD 21223
Next, let's use geopandas to convert each one of those addresses into
a latitude and longitude.
stations_geocoded = gpd.tools.geocode(stations_df.address)
print(stations_geocoded)
geometry
0 POINT (-76.61218 39.29542) \
1 POINT (-76.57794 39.34498)
2 POINT (-76.61718 39.25287)
3 POINT (-76.57335 39.31002)
4 POINT (-76.64474 39.30067)
5 POINT (-76.58289 39.34082)
6 POINT (-76.68513 39.34467)
7 POINT (-76.58799 39.21552)
8 POINT (-76.65676 39.28031)
address
0 Central District Baltimore Police Department, ...
1 2201, East Cold Spring Lane, 21214, East Cold ...
2 Southern District, 10, Cherry Hill Road, 21225...
3 Baltimore Police Department Eastern District, ...
4 Western District Police Station, 1034, North M...
5 Baltimore Police Department Northeast District...
6 Northwest Police Station, 5271, Reisterstown R...
7 5710, Pennington Avenue, 21226, Pennington Ave...
8 424, Millington Avenue, 21223, Millington Aven...
We don't need the station column anymore, and the geocoded address
is superior (more detailed) to the original, so we will replace the
initial dataframe with the new one entirely.
stations_df = stations_geocoded
Finally, let's write our data to the file specified in the SPEC.
generic_export('police-stations', stations_df, 1)
For each district, we will have a coefficient representing population density.
First, let's get the Total Population dataset from Open Baltimore.
download_url = r'https://opendata.arcgis.com/api/v3/datasets/56d5b4e5480049e98315c2732aa48437_0/downloads/data?format=geojson&spatialRefId=4326&where=1%3D1'
download_filepath = data_filepath('population-density', version=0)
download_cached(download_url, download_filepath)
Now, let's read it into a DataFrame.
populations_df = generic_load('population-density', version=0)
print(populations_df.head())
OBJECTID CSA2010 tpop10 tpop20 Shape__Area 0 1 Allendale/Irvington/S. Hilton 16217 14828 6.377046e+07 \ 1 2 Beechfield/Ten Hills/West Hills 12264 12083 4.788253e+07 2 3 Belair-Edison 17416 15085 4.495003e+07 3 4 Brooklyn/Curtis Bay/Hawkins Point 14243 13479 1.760777e+08 4 5 Canton 8100 8239 1.540854e+07 Shape__Length geometry 0 38770.165571 POLYGON ((-76.65726 39.27600, -76.65726 39.276... 1 37524.950533 POLYGON ((-76.69479 39.30201, -76.69465 39.301... 2 31307.314843 POLYGON ((-76.56761 39.32636, -76.56746 39.326... 3 150987.703639 MULTIPOLYGON (((-76.58867 39.21283, -76.58824 ... 4 23338.611948 POLYGON ((-76.57140 39.28441, -76.57138 39.284...
We will use the average of the population value from 2010 and 2020.
populations_df = populations_df.assign(density=lambda df: (df['tpop10'] + df['tpop20']) / df['Shape__Area'])
Let's normalize the density values between -1 and 1, since the actual values themselves are less important than the values with relation to one another. We are using a variant of min-max normalization that puts values between -1 and 1 rather than 0 and 1.
populations_df['density'] = normalize_min_max(populations_df['density'])
Let's now drop our unneeded tpop columns.
populations_df = populations_df.drop(['tpop10', 'tpop20'], axis=1)
Finally, let's export our dataframe.
generic_export('population-density', populations_df, version=1)
For Median Household Income, we will use the Median Household Income dataset from Open Baltimore.
download_url = r'https://opendata.arcgis.com/api/v3/datasets/8613366cfbc7447a9efd9123604c65c1_0/downloads/data?format=geojson&spatialRefId=4326&where=1%3D1'
download_filepath = data_filepath('median-income', version=0)
download_cached(download_url, download_filepath)
Now, let's read it into a DataFrame.
median_income_df = generic_load('median-income', version=0)
print(median_income_df.head())
OBJECTID CSA2010 mhhi10 mhhi11
0 1 Allendale/Irvington/S. Hilton 33563.12 33504.324121 \
1 2 Beechfield/Ten Hills/West Hills 50780.92 50439.739513
2 3 Belair-Edison 42920.83 45149.372510
3 4 Brooklyn/Curtis Bay/Hawkins Point 32888.50 33644.052189
4 5 Canton 74685.14 82129.569175
mhhi12 mhhi13 mhhi14 mhhi15 mhhi16
0 33177.658915 38129.073308 35958.253351 36701.906742 37302.171053 \
1 50135.121622 49807.861765 52622.688525 51537.582075 53565.079698
2 46743.281847 43903.901338 38905.924198 38173.968254 40482.359649
3 33526.364650 34419.965251 35861.896552 36679.053435 38603.930233
4 84978.141631 90862.712924 91735.652736 95362.400309 103281.832192
mhhi17 mhhi18 mhhi19
0 39495.628472 38535.562176 43019.75792 \
1 57572.502747 58055.306613 55017.77971
2 39624.482085 42633.619512 46703.93468
3 40275.275330 39936.512500 39162.13858
4 111891.251825 116911.088235 128460.48210
CSA2020 mhhi20 mhhi21
0 Allendale/Irvington/S. Hilton 42714.60327 42714.60327 \
1 Beechfield/Ten Hills/West Hills 55435.50291 55435.50291
2 Belair-Edison 50010.43737 50010.43737
3 Brooklyn/Curtis Bay/Hawkins Point 32598.81788 32598.81788
4 Canton 134208.76110 134208.76110
Shape__Area Shape__Length
0 6.377046e+07 38770.165571 \
1 4.788253e+07 37524.950533
2 4.495003e+07 31307.314843
3 1.760777e+08 150987.703639
4 1.540854e+07 23338.611948
geometry
0 POLYGON ((-76.65726 39.27600, -76.65726 39.276...
1 POLYGON ((-76.69479 39.30201, -76.69465 39.301...
2 POLYGON ((-76.56761 39.32636, -76.56746 39.326...
3 MULTIPOLYGON (((-76.58867 39.21283, -76.58824 ...
4 POLYGON ((-76.57140 39.28441, -76.57138 39.284...
It gives us median household income from 2010 until 2021. For consistency, we will choose the middle year: 2016 as our standard.
median_income_df = median_income_df[['CSA2010', 'mhhi16', 'Shape__Area', 'Shape__Length', 'geometry']].rename(columns={'mhhi16': 'MedianIncome'})
Let's now normalize our MedianIncome column.
median_income_df['MedianIncome'] = normalize_min_max(median_income_df['MedianIncome'])
Finally, let's export the result.
generic_export('median-income', median_income_df, version=1)
To get the list of liquor stores, bars and strip clubs, we will use the Liquor Licenses | Open Baltimore dataset. Note that we are operating on the assumption that all strip clubs have liquor licenses, but logically this seems like a fair assumption. Unfortunately, there is something wrong with the data export on the server side, so rather than simply downloading the data and filtering it locally, we need to use the SQL query interface to get the data.
query_endpoint = r'https://opendata.baltimorecity.gov/egis/rest/services/NonSpatialTables/Licenses/FeatureServer/0/query'
There are a couple things that will go into our query.
We want to see establishments that are currently open over the course of our crime data. Let's find the latest date for our crime data.
code
# reload the data from the filesystem in case we haven't run the previous cells
crime_df = generic_load('crime-data')
print(crime_df.CrimeDateTime.max())
Thus, we will filter for license end dates only after 06/18/2016,
since this is when our data ends.
We only want places that are liquor stores, bars or strip clubs; not
restaurants or anything else. After querying the endpoint, I found
that the following are the list of EstablishmentDesc categories,
labelling what kind of place the license is for.
Of these options, we are interested in Tavern, which represent
bars, Package goods only which represents liquor stores, and
Adult, which is code for strip clubs. Surprisingly,
Tavern License primarily belongs to restaurants, so we do not
group it in with Tavern.
As described in the documentation, there is a limit on how many data
entries we can return at once. However, if we enable the returnIdsOnly
parameter, this limit does not apply, and we can get all of the ids at
once.
query_params = {
'where': f"LicenseEndDate > DATE '{crime_df.CrimeDateTime.max()}' AND EstablishmentDesc IN ('Tavern', 'Adult', 'Package goods only')",
'returnIdsOnly': 'true',
'f': 'geojson'
}
Unfortunately, making the request yields the error specified here: python - SSL error unsafe legacy renegotiation disabled - Stack Overflow, caused by a bad SSL setup on the server side. This is out of our control, and also out of the scope of this project, so I'm copy-pasting one of the answers from the post in order to make the request. Please pretend this code doesn't exist.
import urllib3
import ssl
class CustomHttpAdapter (requests.adapters.HTTPAdapter):
# "Transport adapter" that allows us to use custom ssl_context.
def __init__(self, ssl_context=None, **kwargs):
self.ssl_context = ssl_context
super().__init__(**kwargs)
def init_poolmanager(self, connections, maxsize, block=False):
self.poolmanager = urllib3.poolmanager.PoolManager(
num_pools=connections, maxsize=maxsize,
block=block, ssl_context=self.ssl_context)
def get_legacy_session():
ctx = ssl.create_default_context(ssl.Purpose.SERVER_AUTH)
ctx.options |= 0x4 # OP_LEGACY_SERVER_CONNECT
session = requests.session()
session.mount('https://', CustomHttpAdapter(ctx))
return session
def legacy_request_get(*args, **kwargs):
return get_legacy_session().get(*args, **kwargs)
Now that that's out of the way, let's make our request to get the list of IDs.
ids = legacy_request_get(query_endpoint, params=query_params).json()['properties']['objectIds']
print(ids[:5])
[1, 2, 3, 4, 5]
Because there's a limit on getting the data (apart from the IDs), we
need to paginate our requests so that we get a certain amount at a time.
From my testing, it appears that the number of entries you can get at
once is 50, so we will use that number going forward.
Let's disable where and returnIdsOnly since we now want the records
from the IDs.
entries_query_params = query_params.copy()
entries_query_params['where'] = None
entries_query_params['returnIdsOnly'] = 'false'
entries_query_params['outFields'] = ','.join(['TradeName', 'EstablishmentDesc', 'AddrStreet', 'AddrZip'])
Let's run our scraping code.
liquor_df = None
entries_per_request = 250
for start_index in range(0, len(ids), entries_per_request):
end_index = min(start_index + entries_per_request, len(ids))
id_chunk = ids[start_index:end_index]
entries_query_params['objectIds'] = ','.join(map(str, id_chunk))
request_res = legacy_request_get(query_endpoint, params=entries_query_params).json()
chunk_df = gpd.GeoDataFrame.from_features(request_res)
if liquor_df is None:
liquor_df = chunk_df
else:
liquor_df = pd.concat([liquor_df, chunk_df])
sleep(0.5)
print(liquor_df)
geometry TradeName EstablishmentDesc
0 None DIVING HORSE GENTLEMAN CLUB Adult \
1 None DIVING HORSE GENTLEMAN CLUB Adult
2 None DIVING HORSE GENTLEMAN CLUB Adult
3 None RED ROOM Adult
4 None GODDESS Adult
.. ... ... ...
209 None TOM & OLIVE'S CAFE Tavern
210 None TOM & OLIVE'S CAFE Tavern
211 None THE HANOVER Tavern
212 None THE HANOVER Tavern
213 None THE HANOVER Tavern
AddrStreet AddrZip
0 5-11 COMMERCE STREET 21202
1 5-11 COMMERCE STREET 21202
2 5-11 COMMERCE STREET 21202
3 411 BALTIMORE STREET EAST 21202
4 36-38 EUTAW STREET SOUTH 21201
.. ... ...
209 2942 MONUMENT STREET EAST 21205
210 2942 MONUMENT STREET EAST 21205
211 3432-34 HANOVER STREET SOUTH 21225
212 3432-34 HANOVER STREET SOUTH 21225
213 3432-34 HANOVER STREET SOUTH 21225
[7714 rows x 5 columns]
Let's reset the index since it's currently numbered strangely.
liquor_df = liquor_df.reset_index(drop=True)
Now that we have that data, let's write it as version 0 of our dataset.
generic_export('liquor-licenses', liquor_df, version=0)
/home/sridaran/Development/Python/Environments/Data/lib/python3.11/site-packages/geopandas/array.py:968: RuntimeWarning: All-NaN slice encountered np.nanmin(b[:, 0]), # minx /home/sridaran/Development/Python/Environments/Data/lib/python3.11/site-packages/geopandas/array.py:969: RuntimeWarning: All-NaN slice encountered np.nanmin(b[:, 1]), # miny /home/sridaran/Development/Python/Environments/Data/lib/python3.11/site-packages/geopandas/array.py:970: RuntimeWarning: All-NaN slice encountered np.nanmax(b[:, 2]), # maxx /home/sridaran/Development/Python/Environments/Data/lib/python3.11/site-packages/geopandas/array.py:971: RuntimeWarning: All-NaN slice encountered np.nanmax(b[:, 3]), # maxy
We have 5139 rows, but unfortunately the endpoint did not give us our
geometry. Therefore, we have to use geocoding once again.
From experimentation, it seems that geocoding fails for addresses that
contain dashes in the numeric element, like 5-11 Foobar Street.
relevant_regex = re.compile('^(\d+)-\d+')
print(liquor_df[liquor_df.AddrStreet.str.match(relevant_regex)])
geometry TradeName EstablishmentDesc
0 None DIVING HORSE GENTLEMAN CLUB Adult \
1 None DIVING HORSE GENTLEMAN CLUB Adult
2 None DIVING HORSE GENTLEMAN CLUB Adult
4 None GODDESS Adult
5 None WAGON WHEEL/PLAYERS CLUB Adult
... ... ... ...
7706 None YESTERYEAR'S LIQUORS Package goods only
7707 None YESTERYEAR'S LIQUORS Package goods only
7711 None THE HANOVER Tavern
7712 None THE HANOVER Tavern
7713 None THE HANOVER Tavern
AddrStreet AddrZip
0 5-11 COMMERCE STREET 21202
1 5-11 COMMERCE STREET 21202
2 5-11 COMMERCE STREET 21202
4 36-38 EUTAW STREET SOUTH 21201
5 821-23 NORTH POINT ROAD 21237
... ... ...
7706 212-214 HIGHLAND AVENUE NORTH 21224
7707 212-214 HIGHLAND AVENUE NORTH 21224
7711 3432-34 HANOVER STREET SOUTH 21225
7712 3432-34 HANOVER STREET SOUTH 21225
7713 3432-34 HANOVER STREET SOUTH 21225
[1948 rows x 5 columns]
To fix this, we'll remove the second dash component in all of these rows.
liquor_df['AddrStreet'] = liquor_df['AddrStreet'].replace(relevant_regex, r'\1')
print(liquor_df[liquor_df.AddrStreet.str.match(relevant_regex)])
Empty GeoDataFrame Columns: [geometry, TradeName, EstablishmentDesc, AddrStreet, AddrZip] Index: []
Next, let's filter out all of the duplicate entries, since for many of the establishments we have several entries corresponding to separate liquor license renewals. First, we'll delete all completely duplicate rows.
liquor_df = liquor_df.drop_duplicates()
print(liquor_df)
geometry TradeName EstablishmentDesc
0 None DIVING HORSE GENTLEMAN CLUB Adult \
3 None RED ROOM Adult
4 None GODDESS Adult
5 None WAGON WHEEL/PLAYERS CLUB Adult
8 None CIRCUS BAR Adult
... ... ... ...
7546 None SIX PAX "N" MORE Package goods only
7644 None ACE LIQUORS & GROCERIES Package goods only
7671 None KWIK MART Tavern
7673 None FIREFLY FARMS MARKET Package goods only
7700 None HAMPDEN YARDS Tavern
AddrStreet AddrZip
0 5 COMMERCE STREET 21202
3 411 BALTIMORE STREET EAST 21202
4 36 EUTAW STREET SOUTH 21201
5 821 NORTH POINT ROAD 21237
8 427 BALTIMORE STREET EAST 21202
... ... ...
7546 6622 BELAIR ROAD 21206
7644 500 MAUDE AVENUE 21225
7671 6656 BELAIR ROAD 21206
7673 3300 CLIPPER MILL ROAD, STALL #7 21201
7700 1014 36TH STREET WEST 21211
[858 rows x 5 columns]
As we can see, the number of rows decreased significantly.
All of the entries marked Adult are what we expected, so we won't
have to verify those. Unfortunately, some of the entries marked as
Tavern and Package goods only are actually restaurants or grocery
stores.
Tavern, some of the listed places do not even seem to sell
alcohol, so these definitely have to be filtered outTo classify the businesses, we will have to use the Google Maps Place
Search
API
to categorize each of these as bar, liquor_store or neither.
In order to use the API, we need to have the name of the business.
stores_without_names = liquor_df[liquor_df['TradeName'] == 'N/A']
print(stores_without_names)
geometry TradeName EstablishmentDesc AddrStreet AddrZip 2329 None N/A Tavern 4017 EASTERN AVENUE 21224 3189 None N/A Tavern 1001 CHARLES STREET NORTH 21201 3412 None N/A Tavern 2322 BOSTON STREET 21224 3466 None N/A Tavern 4100 LOMBARD STREET EAST 21224 3480 None N/A Tavern 3432 HANOVER STREET SOUTH 21225 3502 None N/A Tavern 4600 CURTIS AVENUE 21226 3526 None N/A Tavern 5307 BELAIR ROAD 21206 3580 None N/A Adult 4100 LOMBARD STREET EAST 21224 4532 None N/A Tavern 1739 LIGHT STREET 21230 5024 None N/A Tavern 623 LUZERNE AVENUE SOUTH 21224 5267 None N/A Tavern 5517 HARFORD ROAD 21214 5366 None N/A Package goods only 249 CHASE STREET WEST 21201 5481 None N/A Tavern 709 BROADWAY SOUTH 21231 5797 None N/A Tavern 2903 O'DONNELL STREET 21224 6165 None N/A Tavern 5407 YORK ROAD 21212 6648 None N/A Tavern 2218 BOSTON STREET 21231
Let's drop these rows that don't have names, since they likely correspond to misinputs.
liquor_df = liquor_df[liquor_df['TradeName'] != 'N/A'].reset_index(drop=True)
Now, let's look up the details for the remaining rows. We'll write a
function which will construct a search query from a row in the format:
<business name> <zip code>, and plug that into our previous function.
In my experimentation, this form of query works with the highest
accuracy.
def find_place_with_tradename_zip_code(row):
search_query = f"{row['TradeName']} {row['AddrZip']}"
return google_find_place_from_text_params(search_query)
Finally, let's use our functions to plug all of our rows into the API.
from json import loads, dumps
results1 = run_requests(map_df(find_place_with_tradename_zip_code, liquor_df))
print(dumps(results1[:2], indent=2))
[
{
"candidates": [],
"status": "ZERO_RESULTS"
},
{
"candidates": [
{
"geometry": {
"location": {
"lat": 39.2895378,
"lng": -76.6097109
},
"viewport": {
"northeast": {
"lat": 39.29095042989272,
"lng": -76.60837092010728
},
"southwest": {
"lat": 39.28825077010728,
"lng": -76.61107057989273
}
}
},
"name": "The Red Room Cabaret",
"place_id": "ChIJre21q50EyIkRMAAFZPZjnXU",
"types": [
"point_of_interest",
"establishment"
]
}
],
"status": "OK"
}
]
The very first search resulted in ZERO_RESULTS. Let's see how many of
them there were.
zero_results = list(filter(lambda j: j['status'] == 'ZERO_RESULTS', results1))
print(len(zero_results))
57
These entries likely correspond to businesses that have been closed, so we will ignore them. Next, let's annotate our licenses dataframe with our new results.
Let's annotate our liquor_df with whether the rows correspond to bar
and liquor_store. First, let's add a column corresponding to the
response from the Google Places API.
liquor_df['google_response'] = results1
Before we move on, let's save another version of our data.
generic_export('liquor-licenses', liquor_df, version=1)
/home/sridaran/Development/Python/Environments/Data/lib/python3.11/site-packages/geopandas/array.py:968: RuntimeWarning: All-NaN slice encountered np.nanmin(b[:, 0]), # minx /home/sridaran/Development/Python/Environments/Data/lib/python3.11/site-packages/geopandas/array.py:969: RuntimeWarning: All-NaN slice encountered np.nanmin(b[:, 1]), # miny /home/sridaran/Development/Python/Environments/Data/lib/python3.11/site-packages/geopandas/array.py:970: RuntimeWarning: All-NaN slice encountered np.nanmax(b[:, 2]), # maxx /home/sridaran/Development/Python/Environments/Data/lib/python3.11/site-packages/geopandas/array.py:971: RuntimeWarning: All-NaN slice encountered np.nanmax(b[:, 3]), # maxy
Now, let's define some functions we can use to determine if a row
corresponds to a category. We want to automatically classify places that
have certain words in their name, to account for locations which the
Places API does not have record of. We will say that a row corresponds
to category X if the Places API says that it has the correct type,
or if the name contains one of the magic keywords.
def test_string_contains_any(string, expected_substrings) -> bool:
lowercase_string = string.lower()
return any(map(lambda sub: (sub.lower() in lowercase_string), expected_substrings))
def make_is_type_column(row, wanted_type, automatic_keywords):
try:
first_candidate = row['google_response']['candidates'][0]
res = (wanted_type in first_candidate['types']) or test_string_contains_any(first_candidate['name'], automatic_keywords)
except:
res = False
return res or test_string_contains_any(row['TradeName'], automatic_keywords)
We will say that a bar must either have the bar type, OR have bar in
the name. A liquor store corresponds to the liquor_store Places API
type, and has the magic keywords liquor, wine and spirits.
liquor_df['is_bar'] = liquor_df.apply(make_is_type_column, axis=1, args=('bar', ['bar'])).astype('bool')
liquor_df['is_liquor_store'] = liquor_df.apply(make_is_type_column, axis=1, args=('liquor_store', ['liquor', 'wine', 'spirits'])).astype('bool')
Now, let's see which Tavern's don't actually correspond to bars.
is_fake_tavern = (liquor_df['EstablishmentDesc'] == 'Tavern') & (~liquor_df['is_bar'])
fake_taverns = liquor_df[is_fake_tavern]
print(fake_taverns)
geometry TradeName EstablishmentDesc
31 None CLUB LUZERNE Tavern \
35 None KNIGHT'S LIQUORS Tavern
46 None SHERRY'S PLACE Tavern
47 None YOUNG'S Tavern
50 None JEWEL BOX Tavern
.. ... ... ...
831 None PLAKA TAVERN Tavern
834 None RED DOOR Tavern
835 None E.A.T. (EGGROLLS AND TACOS) Tavern
836 None CAFE CAMPLI Tavern
839 None KWIK MART Tavern
AddrStreet AddrZip
31 542 LUZERNE AVENUE NORTH 21205 \
35 5139 PARK HEIGHTS AVENUE 21215
46 238 BROADWAY SOUTH 21231
47 2401 CHASE STREET EAST 21213
50 419 BALTIMORE STREET EAST 21202
.. ... ...
831 4714 EASTERN AVENUE 21224
834 625 GLOVER STREET NORTH 21205
835 1371 ANDRE STREET 21230
836 4801 HARFORD ROAD, SUITE S2 21214
839 6656 BELAIR ROAD 21206
google_response is_bar
31 {'candidates': [{'geometry': {'location': {'la... False \
35 {'candidates': [{'geometry': {'location': {'la... False
46 {'candidates': [], 'status': 'ZERO_RESULTS'} False
47 {'candidates': [], 'status': 'ZERO_RESULTS'} False
50 {'candidates': [{'geometry': {'location': {'la... False
.. ... ...
831 {'candidates': [], 'status': 'ZERO_RESULTS'} False
834 {'candidates': [{'geometry': {'location': {'la... False
835 {'candidates': [{'geometry': {'location': {'la... False
836 {'candidates': [{'geometry': {'location': {'la... False
839 {'candidates': [{'geometry': {'location': {'la... False
is_liquor_store
31 False
35 True
46 False
47 False
50 False
.. ...
831 False
834 True
835 False
836 False
839 False
[205 rows x 8 columns]
Next, fake liquor stores.
is_fake_liquor_store = (liquor_df['EstablishmentDesc'] == 'Package goods only') & (~liquor_df['is_liquor_store'])
fake_liquor_stores = liquor_df[is_fake_liquor_store]
print(fake_liquor_stores.head())
geometry TradeName EstablishmentDesc
24 None TRINACRIA FOODS Package goods only \
29 None KING'S KORNER Package goods only
30 None CVS/PHARMACY #4107 Package goods only
69 None JENIS 7 STAR FOOD MARKET Package goods only
71 None SECURED CREDITOR Package goods only
AddrStreet AddrZip
24 406 PACA STREET NORTH 21201 \
29 1713 FEDERAL STREET EAST 21213
30 6828 REISTERSTOWN ROAD 21215
69 4220 PENNINGTON AVENUE 21226
71 2300 OREM AVENUE 21217
google_response is_bar is_liquor_store
24 {'candidates': [{'geometry': {'location': {'la... False False
29 {'candidates': [{'geometry': {'location': {'la... False False
30 {'candidates': [{'geometry': {'location': {'la... False False
69 {'candidates': [{'geometry': {'location': {'la... False False
71 {'candidates': [{'geometry': {'location': {'la... False False
Next, let's drop any rows that don't match their expectations.
liquor_df = liquor_df.drop(liquor_df[is_fake_tavern | is_fake_liquor_store].index)
print(liquor_df)
geometry TradeName EstablishmentDesc
0 None DIVING HORSE GENTLEMAN CLUB Adult \
1 None RED ROOM Adult
2 None GODDESS Adult
3 None WAGON WHEEL/PLAYERS CLUB Adult
4 None CIRCUS BAR Adult
.. ... ... ...
832 None HAMILTON PARK LIQUORS Package goods only
833 None WONDERLAND LIQUORS Tavern
837 None SIX PAX "N" MORE Package goods only
838 None ACE LIQUORS & GROCERIES Package goods only
841 None HAMPDEN YARDS Tavern
AddrStreet AddrZip
0 5 COMMERCE STREET 21202 \
1 411 BALTIMORE STREET EAST 21202
2 36 EUTAW STREET SOUTH 21201
3 821 NORTH POINT ROAD 21237
4 427 BALTIMORE STREET EAST 21202
.. ... ...
832 2323 NORTHERN PARKWAY EAST 21214
833 2043 PENNSYLVANIA AVENUE 21217
837 6622 BELAIR ROAD 21206
838 500 MAUDE AVENUE 21225
841 1014 36TH STREET WEST 21211
google_response is_bar
0 {'candidates': [], 'status': 'ZERO_RESULTS'} False \
1 {'candidates': [{'geometry': {'location': {'la... True
2 {'candidates': [{'geometry': {'location': {'la... False
3 {'candidates': [{'geometry': {'location': {'la... False
4 {'candidates': [{'geometry': {'location': {'la... True
.. ... ...
832 {'candidates': [{'geometry': {'location': {'la... False
833 {'candidates': [{'geometry': {'location': {'la... True
837 {'candidates': [{'geometry': {'location': {'la... False
838 {'candidates': [{'geometry': {'location': {'la... False
841 {'candidates': [{'geometry': {'location': {'la... True
is_liquor_store
0 False
1 False
2 False
3 False
4 False
.. ...
832 True
833 True
837 True
838 True
841 False
[583 rows x 8 columns]
liquor_df['geometry'] = liquor_df['google_response'].apply(extract_geometry_from_places_request)
Let's see how many rows don't have geometry now.
liquor_without_geometry = liquor_df[liquor_df['geometry'].isnull()].copy(deep=True)
print(liquor_without_geometry)
geometry TradeName EstablishmentDesc
0 None DIVING HORSE GENTLEMAN CLUB Adult \
108 None BACCHUS BAR & LIQUORS Tavern
123 None BULL ROBINSON LIQUORS Package goods only
184 None REGAL LIQUORS Package goods only
244 None SHUSTER'S LIQUOR Package goods only
402 None MADELINE DELI GROCERY BEER AND WINE Package goods only
502 None BAYVIEW LIQUORS SPORTS BAR& LOUNGE Tavern
588 None PERRY'S LIQUORS Package goods only
591 None IRVINGTON CUT RATE LIQUORS Package goods only
726 None GAYETY LIQUORS Package goods only
817 None SEA TOP GROCERY AND LIQUOR Package goods only
AddrStreet AddrZip
0 5 COMMERCE STREET 21202 \
108 1220 NORTH AVENUE WEST 21217
123 2736 GREENMOUNT AVENUE 21218
184 900 GILMOR STREET NORTH 21217
244 1231 BALTIMORE STREET WEST 21223
402 401 FURROW STREET SOUTH 21223
502 3804 EASTERN AVE 21224
588 2550 EDMONDSON AVENUE 21223
591 4100 FREDERICK AVENUE 21229
726 412 BALTIMORE STREET EAST 21202
817 4420 PENNINGTON AVENUE 21226
google_response is_bar is_liquor_store
0 {'candidates': [], 'status': 'ZERO_RESULTS'} False False
108 {'candidates': [], 'status': 'ZERO_RESULTS'} True True
123 {'candidates': [], 'status': 'ZERO_RESULTS'} False True
184 {'candidates': [], 'status': 'ZERO_RESULTS'} False True
244 {'candidates': [], 'status': 'ZERO_RESULTS'} False True
402 {'candidates': [], 'status': 'ZERO_RESULTS'} False True
502 {'candidates': [], 'status': 'ZERO_RESULTS'} True True
588 {'candidates': [], 'status': 'ZERO_RESULTS'} False True
591 {'candidates': [], 'status': 'ZERO_RESULTS'} False True
726 {'candidates': [], 'status': 'ZERO_RESULTS'} False True
817 {'candidates': [], 'status': 'ZERO_RESULTS'} False True
For the remaining rows, let's geocode slightly differently, by plugging
the address into the Places API rather than the name of the business and
zip code. First, let's construct the Address column by combining
street and zip code.
liquor_without_geometry['Address'] = liquor_without_geometry['AddrStreet'] + ' ' + liquor_without_geometry['AddrZip']
Next, let's plug this column into the API to get our locations.
def get_premise(row):
search_query = row['Address']
return google_find_place_from_text_params(search_query)
results = run_requests(map_df(get_premise, liquor_without_geometry))
Now, let's overwrite our previous google_response column with our new
responses, and then use our extract_geometry function to extract the
location from the responses.
liquor_without_geometry['google_response'] = results
liquor_without_geometry['geometry'] = liquor_without_geometry['google_response'].apply(extract_geometry_from_places_request)
Let's see how many more addresses we got from this approach.
print(len(liquor_without_geometry[~liquor_without_geometry['geometry'].isnull()]))
10
Any remaining rows will be dropped, since if they do not exist on Google they are likely statistically insignificant.
liquor_without_geometry = liquor_without_geometry.drop(liquor_without_geometry[liquor_without_geometry['geometry'].isnull()].index)
With that done, let's update our original dataframe with our new data.
liquor_df.update(liquor_without_geometry)
Now, any remaining null rows will be dropped, as they likely do not exist anymore.
print(liquor_df[liquor_df['geometry'].isnull()])
liquor_df = liquor_df.drop(liquor_df[liquor_df['geometry'].isnull()].index)
geometry TradeName EstablishmentDesc AddrStreet
108 None BACCHUS BAR & LIQUORS Tavern 1220 NORTH AVENUE WEST \
AddrZip google_response is_bar
108 21217 {'candidates': [], 'status': 'ZERO_RESULTS'} True \
is_liquor_store
108 True
Finally, let's commit our data.
generic_export('liquor-licenses', liquor_df, version=2)
Next, let's add an is_strip_club column, derived directly from the
EstablishmentDesc column.
liquor_df['is_strip_club'] = liquor_df['EstablishmentDesc'] == 'Adult'
print(liquor_df[liquor_df['is_strip_club'] == True].head())
geometry TradeName EstablishmentDesc
0 POINT (-76.61007 39.28948) DIVING HORSE GENTLEMAN CLUB Adult \
1 POINT (-76.60971 39.28954) RED ROOM Adult
2 POINT (-76.62099 39.28765) GODDESS Adult
3 POINT (-76.53867 39.30348) WAGON WHEEL/PLAYERS CLUB Adult
4 POINT (-76.60913 39.28967) CIRCUS BAR Adult
AddrStreet AddrZip
0 5 COMMERCE STREET 21202 \
1 411 BALTIMORE STREET EAST 21202
2 36 EUTAW STREET SOUTH 21201
3 821 NORTH POINT ROAD 21237
4 427 BALTIMORE STREET EAST 21202
google_response is_bar is_liquor_store
0 {'candidates': [{'geometry': {'location': {'la... False False \
1 {'candidates': [{'geometry': {'location': {'la... True False
2 {'candidates': [{'geometry': {'location': {'la... False False
3 {'candidates': [{'geometry': {'location': {'la... False False
4 {'candidates': [{'geometry': {'location': {'la... True False
is_strip_club
0 True
1 True
2 True
3 True
4 True
Now, we can remove EstablishmentDesc, since it's been superseded by
the three is_<type> columns
liquor_df = liquor_df.drop('EstablishmentDesc', axis=1)
Finally, let's commit the last version of the data.
generic_export('liquor-licenses', liquor_df, version=3)
To find the vacant buildings, we will use Open Baltimore's Vacant Building Notices dataset.
download_url = r'https://opendata.arcgis.com/api/v3/datasets/70de1e9137ce455aaee58f38b281d2cc_1/downloads/data?format=geojson&spatialRefId=4326&where=1%3D1'
download_cached(download_url, data_filepath('vacant-buildings', version=0))
vacant_buildings_df = generic_load('vacant-buildings', version=0)
print(vacant_buildings_df)
OBJECTID NoticeNum DateNotice DateCancel DateAbate
0 723192 805231A 2012-01-25 15:36:59+00:00 NaN NaN \
1 723193 1780434A 2019-04-20 08:55:00+00:00 NaN NaN
2 723196 927919A 2013-02-01 14:19:00+00:00 NaN NaN
3 723197 2106499A 2022-05-13 16:20:00+00:00 NaN NaN
4 723202 2078873A 2022-02-10 16:01:00+00:00 NaN NaN
... ... ... ... ... ...
14203 960932 2018569A 2021-07-15 13:28:59+00:00 NaN NaN
14204 960939 1581740A 2017-09-03 14:36:59+00:00 NaN NaN
14205 960966 2177271A 2022-11-03 16:20:00+00:00 NaN NaN
14206 960978 1695135A 2018-07-14 16:20:00+00:00 NaN NaN
14207 962285 1826489A 2019-08-10 12:40:59+00:00 NaN NaN
NT OWNER_ABBR HousingMarketTypology2017 Council_District
0 Vacant NaN I 7 \
1 Vacant NaN I 7
2 Vacant NaN I 7
3 Vacant NaN I 7
4 Vacant NaN I 7
... ... ... ... ...
14203 Vacant NaN A 1
14204 Vacant NaN J 12
14205 Vacant NaN B 11
14206 Vacant NaN I 12
14207 Vacant NaN H 13
Neighborhood BLOCKLOT
0 EASTERWOOD 0001 003 \
1 EASTERWOOD 0001 004
2 EASTERWOOD 0001 007
3 EASTERWOOD 0001 008
4 EASTERWOOD 0001 013
... ... ...
14203 FELLS POINT 1847 015
14204 EAST BALTIMORE MIDWAY 4004 015
14205 DOWNTOWN 1351 016
14206 EAST BALTIMORE MIDWAY 4063 016
14207 MIDDLE EAST 1606 084
Address geometry
0 2041 W NORTH AVE POINT (-76.65104 39.30943)
1 2039 W NORTH AVE POINT (-76.65099 39.30943)
2 2033 W NORTH AVE POINT (-76.65084 39.30943)
3 2031 W NORTH AVE POINT (-76.65080 39.30943)
4 2021 W NORTH AVE POINT (-76.65055 39.30944)
... ... ...
14203 1935 ALICEANNA ST POINT (-76.58897 39.28342)
14204 1034 E NORTH AVE POINT (-76.60431 39.31196)
14205 23 S GAY ST POINT (-76.60845 39.28904)
14206 2403 GREENMOUNT AVE POINT (-76.60914 39.31685)
14207 825 N MADEIRA ST POINT (-76.58581 39.30052)
[14208 rows x 13 columns]
We are not concerned with buildings which were only declared vacant
after the end date of our crime data, so let's filter out those rows.
First, let's convert the Timezone-Aware DateNotice column to a UTC
NaiveDateTime, without the Timezone information. This will allow us to
compare it to any other Naive datetime's that we have.
vacant_buildings_df['DateNotice'] = vacant_buildings_df['DateNotice'].dt.tz_convert(None)
Now, let's drop the rows we don't want.
crime_df = generic_load('crime-data')
last_crime_date = crime_df['CrimeDateTime'].max()
vacant_buildings_df = vacant_buildings_df.drop(vacant_buildings_df[vacant_buildings_df['DateNotice'] > last_crime_date].index)
vacant_buildings_df = vacant_buildings_df.reset_index(drop=True)
print(vacant_buildings_df['DateNotice'])
0 2012-01-25 15:36:59
1 2013-02-01 14:19:00
2 2016-01-28 14:47:00
3 2010-12-14 16:57:00
4 2008-02-08 12:29:00
...
5579 2015-02-26 15:37:59
5580 2011-10-14 16:04:00
5581 2015-10-02 15:07:00
5582 2013-06-28 10:18:00
5583 2009-03-10 16:41:00
Name: DateNotice, Length: 5584, dtype: datetime64[ns]
That got rid of about 1/3 of the entries! Let's export our processed version of the data.
generic_export('vacant-buildings', vacant_buildings_df, version=1)
This is where we will see what our data is telling us, so that we can
make better judgements on what to look at for interpretation. Let's take
a subset of our crime data, since it's currently difficult to work with
on consumer hardware. Let's take data from the year 2015, which was
the last full year in the data, and then do a stratified random sample,
taking 33% of all data from each month. We will also choose a static
random seed for deterministic results.
import numpy as np
random_seed = 12345
crime_df = generic_load('crime-data')
crime_codes_df = generic_load('crime-codes')
crime_df = crime_df.merge(crime_codes_df[['CODE', 'VIO_PROP_CFS']], how='left', left_on='CrimeCode', right_on='CODE').drop(columns='CODE')
crime_df['crime_severity'] = np.where(crime_df['VIO_PROP_CFS'] == 'VIOLENT', 5, 1)
crime_df = crime_df.drop('VIO_PROP_CFS', axis=1)
crime_df['CrimeMonth'] = crime_df['CrimeDateTime'].dt.month
subset_crime_df = crime_df[crime_df['CrimeDateTime'].dt.year == 2015].groupby(
'CrimeMonth').apply(lambda m: m.sample(frac=0.20, random_state=random_seed)
).reset_index(drop=True)
Let's group the crimes into CSA's using geopandas.sjoin. The function
adds an index_right column with the index value of the intersected row
on the right, but we will drop it since we don't need it.
baltimore_csa_df = generic_load('baltimore-csa')
subset_crime_df = subset_crime_df.set_crs(baltimore_csa_df.crs)
subset_crime_df = gpd.sjoin(subset_crime_df, baltimore_csa_df[['geometry', 'FID', 'Community']]).drop(columns=['index_right'])
Now, let's do a chloropleth map showing which regions have the most crime.
crimes_per_csa = subset_crime_df.groupby('FID').apply(lambda group: len(group)).reset_index(drop=True)
baltimore_csa_df['num_crimes'] = crimes_per_csa
First, there' s a lot of boilerplate for making choropleth maps using Plotly, so let's abstract this into a generic function we can use for making many different choropleths.
import plotly.express as px
import geopandas as gpd
# Convert GeoDataFrame to GeoJSON
geojson = loads(baltimore_csa_df.to_json())
def choropleth_baltimore(df, color_col, main_label, **kwargs):
plotly_kwargs = {
'data_frame': df,
'geojson': geojson,
'locations': df.index,
'color': color_col,
'color_continuous_scale': "Reds",
'mapbox_style': "carto-positron",
'zoom': 10,
'center': {"lat": 39.29, "lon": -76.61},
'range_color': [df[color_col].min(), df[color_col].max()],
'opacity': 0.5,
'labels': {color_col: main_label},
'hover_name': 'Community'
}
plotly_kwargs.update(kwargs)
return px.choropleth_mapbox(**plotly_kwargs)
Let's also make a function for drawing a scatterplot trace over a figure.
import plotly.graph_objects as go
def scatter_trace_baltimore(fig, df, hover_col=None, marker_size=5, **kwargs):
plotly_kwargs = {
'lon': df.geometry.x,
'lat': df.geometry.y,
'mode': 'markers',
'marker': {'size': marker_size}
} | {'hovertext': df[hover_col]} if hover_col else {}
plotly_kwargs.update(kwargs)
fig.add_trace(go.Scattermapbox(**plotly_kwargs))
Now, let's make a function for plotting crime.
def choropleth_crime(num_crimes_colname='num_crimes', **kwargs):
helper_kwargs = {
'color_col': num_crimes_colname,
'main_label': 'Number of Crimes'
}
helper_kwargs.update(kwargs)
return choropleth_baltimore(baltimore_csa_df, **helper_kwargs)
fig = choropleth_crime()
# Show the graph
fig.show()
These numbers are biased towards larger areas, so let's normalize the values using the normalized size of the area.
normalized_area = normalize_min_max(baltimore_csa_df.geometry.area)
baltimore_csa_df['num_crimes_normalized'] = normalize_min_max(
baltimore_csa_df['num_crimes'] / (normalized_area + 1))
/tmp/ipykernel_2124824/2427232003.py:1: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.
choropleth_crime(num_crimes_colname='num_crimes_normalized').show()
Surprisingly, the graph looks the exact same after normalization.
population_df = generic_load('population-density')
population_df['Community'] = baltimore_csa_df['Community']
choropleth_baltimore(population_df, 'density', 'Population Density (normalized)').show()
At a first glance, there doe't seem to be a huge correlation between population density and the normalized number of crimes.
median_income_df = generic_load('median-income')
median_income_df['Community'] = baltimore_csa_df['Community']
median_income_df['MedianIncomeNeg'] = -median_income_df['MedianIncome'] + 1
choropleth_baltimore(median_income_df, 'MedianIncomeNeg', 'Inverse Median Income (normalized)').show()
At first glance, there also isn't a huge correlation between median income and crime rate.
In order to analyze liquor stores, bars and strip clubs, let's add this data to our DataFrame.
First, we'll categorize each establishment. Categorizing dataframes will be a common operation so we'll abstract it into a function.
liquor_df = generic_load('liquor-licenses')
def categorize_points_csa(df):
return gpd.sjoin(df, baltimore_csa_df[['geometry', 'FID', 'Community']]).drop(columns=['index_right'])
liquor_df = categorize_points_csa(liquor_df.set_crs(baltimore_csa_df.crs))
Next, we'll count the total number of each type for each CSA. We'll
make a generic function for this since this will be a common pattern.
This is a function with side effects; it will update
baltimore_csa_df with the new column.
def add_counts_per_type(df, keys, out_keys) -> None:
global baltimore_csa_df
counts_per_type = df.groupby('FID').apply(lambda group: pd.Series([len(group[group[key]]) for key in keys]) if keys else pd.Series([len(group)]))
counts_per_type.columns = out_keys
counts_per_type.name = 'counts_per_type'
baltimore_csa_df = baltimore_csa_df.merge(counts_per_type, left_on='FID', right_index=True, how='left')
baltimore_csa_df = baltimore_csa_df.fillna({key: 0 for key in counts_per_type.columns})
add_counts_per_type(liquor_df, ['is_liquor_store', 'is_bar', 'is_strip_club'], ['num_liquor_stores', 'num_bars', 'num_strip_clubs'])
fig = choropleth_baltimore(baltimore_csa_df, 'num_liquor_stores', 'Number of liquor stores')
scatter_trace_baltimore(fig, liquor_df[liquor_df['is_liquor_store']], 'TradeName')
fig.show()
We can see that Southwest Baltimore, the region with the highest crime in the city, has the most liquor stores out of all the regions. This suggests a correlation between liquor stores and crime, but we will test that formally later on.
fig = choropleth_baltimore(baltimore_csa_df, 'num_strip_clubs', 'Number of strip clubs')
scatter_trace_baltimore(fig, liquor_df[liquor_df['is_strip_club']], 'TradeName')
fig.show()
I am baffled by the 14 distinct strip clubs all right next to each other in Seton Hill. I looked up a couple of them and they all seemed to be different places too…
Other than that, it does seem that strip clubs have a strong correlation to crime.
fig = choropleth_baltimore(baltimore_csa_df, 'num_bars', 'Number of bars')
scatter_trace_baltimore(fig, liquor_df[liquor_df['is_bar']], 'TradeName')
fig.show()
It seems that the area with the most bars is Fells Point, which was not one of the regions with the highest crime. This suggests that bars, in comparison to liquor stores and strip clubs, do not have a significant effect on crime.
Let's classify the vacant buildings as we've done with the other data.
vacant_df = generic_load('vacant-buildings')
vacant_df = categorize_points_csa(vacant_df)
Next, we'll count the total number of vacant buildings for each CSA.
add_counts_per_type(vacant_df, [], ['num_vacant_buildings'])
fig = choropleth_baltimore(baltimore_csa_df, 'num_vacant_buildings', 'Number of vacant buildings')
scatter_trace_baltimore(fig, vacant_df, 'Address')
fig.show()
As we can see, there are a ton of vacant buildings in Southwest Baltimore, after which is Harlem Park.
Let's count the number of police stations in each CSA.
stations_df = generic_load('police-stations')
stations_df = categorize_points_csa(stations_df)
Next, we'll count the total number of police stations for each CSA.
add_counts_per_type(stations_df, [], ['num_police_stations'])
fig = choropleth_baltimore(baltimore_csa_df, 'num_police_stations', 'Number of police stations')
scatter_trace_baltimore(fig, stations_df, 'address')
fig.show()
Despite having the highest crime rate, Southwest Baltimore does have a police station.
Let's see if there's a correlation between our collected factors.
import statsmodels.formula.api as smf
baltimore_csa_df['MedianIncome'] = median_income_df['MedianIncome']
baltimore_csa_df['Density'] = population_df['density']
model = smf.ols(formula='num_crimes_normalized ~ MedianIncome + Density + num_bars + num_strip_clubs + num_liquor_stores + num_vacant_buildings + num_police_stations', data=baltimore_csa_df).fit()
print(model.summary())
OLS Regression Results
=================================================================================
Dep. Variable: num_crimes_normalized R-squared: 0.593
Model: OLS Adj. R-squared: 0.533
Method: Least Squares F-statistic: 9.793
Date: Fri, 12 May 2023 Prob (F-statistic): 1.85e-07
Time: 13:55:55 Log-Likelihood: 30.523
No. Observations: 55 AIC: -45.05
Df Residuals: 47 BIC: -28.99
Df Model: 7
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
Intercept 0.2135 0.057 3.764 0.000 0.099 0.328
MedianIncome -0.0280 0.093 -0.302 0.764 -0.214 0.158
Density -0.0402 0.118 -0.341 0.734 -0.277 0.197
num_bars -0.0011 0.004 -0.303 0.763 -0.008 0.006
num_strip_clubs 0.0204 0.008 2.491 0.016 0.004 0.037
num_liquor_stores 0.0346 0.010 3.439 0.001 0.014 0.055
num_vacant_buildings 0.0002 0.000 1.422 0.162 -9.2e-05 0.001
num_police_stations 0.0111 0.064 0.174 0.863 -0.118 0.140
==============================================================================
Omnibus: 10.485 Durbin-Watson: 1.880
Prob(Omnibus): 0.005 Jarque-Bera (JB): 10.220
Skew: 0.924 Prob(JB): 0.00604
Kurtosis: 4.022 Cond. No. 1.42e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.42e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Let's get rid of the factors that seem improbable, namely income, density and police stations.
model = smf.ols(formula='num_crimes_normalized ~ num_bars + num_strip_clubs + num_liquor_stores + num_vacant_buildings', data=baltimore_csa_df).fit()
print(model.summary())
OLS Regression Results
=================================================================================
Dep. Variable: num_crimes_normalized R-squared: 0.568
Model: OLS Adj. R-squared: 0.534
Method: Least Squares F-statistic: 16.75
Date: Fri, 12 May 2023 Prob (F-statistic): 7.95e-09
Time: 13:55:58 Log-Likelihood: 29.849
No. Observations: 56 AIC: -49.70
Df Residuals: 51 BIC: -39.57
Df Model: 4
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
Intercept 0.1943 0.035 5.602 0.000 0.125 0.264
num_bars -0.0017 0.003 -0.507 0.615 -0.008 0.005
num_strip_clubs 0.0208 0.008 2.598 0.012 0.005 0.037
num_liquor_stores 0.0351 0.010 3.538 0.001 0.015 0.055
num_vacant_buildings 0.0002 0.000 1.337 0.187 -8.89e-05 0.000
==============================================================================
Omnibus: 8.622 Durbin-Watson: 1.985
Prob(Omnibus): 0.013 Jarque-Bera (JB): 7.894
Skew: 0.840 Prob(JB): 0.0193
Kurtosis: 3.748 Cond. No. 422.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Based on the summary, it seems that there is almost a correlation between crime and population density. However, at the 95% confidence level we cannot state the correlation. On the other hand, median income seems to have no correlation at all.
model = smf.ols(formula='num_crimes_normalized ~ num_strip_clubs + num_liquor_stores + num_vacant_buildings', data=baltimore_csa_df).fit()
print(model.summary())
OLS Regression Results
=================================================================================
Dep. Variable: num_crimes_normalized R-squared: 0.566
Model: OLS Adj. R-squared: 0.541
Method: Least Squares F-statistic: 22.57
Date: Fri, 12 May 2023 Prob (F-statistic): 1.71e-09
Time: 13:56:02 Log-Likelihood: 29.708
No. Observations: 56 AIC: -51.42
Df Residuals: 52 BIC: -43.32
Df Model: 3
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
Intercept 0.1958 0.034 5.710 0.000 0.127 0.265
num_strip_clubs 0.0203 0.008 2.570 0.013 0.004 0.036
num_liquor_stores 0.0319 0.008 4.147 0.000 0.016 0.047
num_vacant_buildings 0.0002 0.000 1.685 0.098 -3.89e-05 0.000
==============================================================================
Omnibus: 8.415 Durbin-Watson: 1.981
Prob(Omnibus): 0.015 Jarque-Bera (JB): 7.655
Skew: 0.831 Prob(JB): 0.0218
Kurtosis: 3.718 Cond. No. 419.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Removing vacant buildings makes the R-squared drop even lower, so we will stick with this. Lastly, let's graph the model's predictions, and compare them to the real values.
predicted_crime_values = model.predict(baltimore_csa_df)
baltimore_csa_df['num_crimes_normalized_predicted'] = predicted_crime_values
fig = choropleth_crime(num_crimes_colname='num_crimes_normalized_predicted', main_label='Number of crimes predicted (normalized)')
fig.show()
As we can see, the basic model predicted the peaks correctly, and the rest of the colors were not too far off. To observe it more clearly, let's do a graph of residuals.
baltimore_csa_df['num_crimes_residuals'] = baltimore_csa_df['num_crimes_normalized'] - predicted_crime_values
fig = choropleth_crime(num_crimes_colname='num_crimes_residuals', main_label='(Expected - Actual) for normalized crime rate')
fig.show()
The linear regression model was not very accurate for pure prediction purposes, but it was enough to show a correlation between the variables.
To identify crime hotspots, we will use K-Means clustering. Let's
iteratively find a value for K.
We're going to be using it with ML, so let's normalize the relevant values.
from scipy.cluster.vq import whiten
subset_crime_df_latlon = pd.DataFrame({
'lon': subset_crime_df.geometry.x,
'lat': subset_crime_df.geometry.y
})
normalized_df = whiten(subset_crime_df_latlon)
Now, let's train the KMeans model according to Tamjid Ahsan, in his article Selecting optimal K for K-means clustering.
from sklearn.cluster import KMeans
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
search_range = range(2, 21)
report = {}
trained_models = {}
for k in search_range:
temp_dict = {}
kmeans = KMeans(init='k-means++',
algorithm='lloyd',
n_clusters=k,
n_init='auto',
max_iter=1000,
random_state=1,
verbose=0).fit(normalized_df)
inertia = kmeans.inertia_
temp_dict['Sum of squared error'] = inertia
try:
cluster = kmeans.predict(normalized_df)
chs = metrics.calinski_harabasz_score(normalized_df, cluster)
ss = metrics.silhouette_score(normalized_df, cluster)
temp_dict['Calinski Harabasz Score'] = chs
temp_dict['Silhouette Score'] = ss
report[k] = temp_dict
trained_models[k] = cluster
except Exception as e:
print(e)
report[k] = temp_dict
report_df = pd.DataFrame(report).T
report_df.plot(figsize=(15, 10),
xticks=search_range,
grid=True,
title=f'Selecting optimal "K"',
subplots=True,
marker='o',
sharex=True)
plt.tight_layout()
It looks like 6 is the ideal number of clusters, since it has the highest Silhouette score, and also because the squared error seems to start flattening out at 6.. Let's draw these 6 clusters on the map.
k = 6
print(trained_models[k])
subset_crime_df['cluster'] = trained_models[k]
px.scatter_mapbox(
subset_crime_df,
lon = subset_crime_df.geometry.x,
lat = subset_crime_df.geometry.y,
color = 'cluster',
# size = 7,
opacity=0.6,
mapbox_style='open-street-map',
zoom=10
).show()
[0 0 0 ... 5 5 5]
The K-Means algorithm gave us 6 different clusters of crime, such
that the distance within each cluster is minimized. Let's see how many
points are in each cluster:
counts = np.bincount(trained_models[k])
print(counts)
[2820 1056 506 1417 1112 2744]
Based on this, we can say that the first and last clusters likely correspond to hotspots. On the graph, these two correlate to the center cross-section of Baltimore. This makes sense, since it has the highest proximity to everything else in the city.
In this project, we analyzed Baltimore crime data, and drew connections
between different city-level factors and crime. along with crime data,
we found the locations of all the police stations, the density of the
population in the 56 CSA's, the median incomes, the liquor stores,
bars and strip clubs, and all the vacant buildings. With more time, we
could have explored even more factors, like education level, proximity
to parks and gang activity.
These factors were all carefully chosen after preliminary research, with the assumption that they would naturally work. however, after reaching the end of the project, it has become clear that nothing in the real world is as straightforward as theory may lead you to believe. Crime is nowhere close to something that can easily be modelled with such a small amount of data. It involves countless sociopolitical facts of life, which would take far longer to analyze.
The results of this analysis showed that proximity to liquor stores and strip clubs are very likely factors that increase crime, and the rest of the factors were dismissed. However, it's quite possible that the dismissed factors were good factors themselves, but unfavorable sampling resulted in them coming up as irrelevant metrics.